the siren song of backpressure

2021-06-02 ยท 6 min read

The Pieces #

We have three major components that compose an async system: (1) Actors, which contain (2) Event Loops, which communicate with other Actors using (3) Channels.

Backpressure #

Backpressure in the real world is like a plumbing pipe. When the pipe is full, it admits no more water and the pipe "exerts backpressure" on its inputs (who will now start filling up because they can't push their water forward to their outputs). All the water sent into the pipe eventually ends up at the output.

Backpressure is compositional: a series of simple pipes hooked end-to-end itself exerts backpressure. We can break the total system backpressure with a leaky pipe in the middle, which loses some of the water.

In systems terms, a pipe is a bounded FIFO queue.

The Allure of Backpressure #

A system designed with all parts exerting backpressure promises: "use these simple tricks and your system will degrade gracefully under load! Lose no messages! Never OOM!"

A local system that fully exerts backpressure end-to-end provides a natural mechanism for propagating that backpressure to remote peers via TCP connections.

If an internal component is at capacity, it will eventually propagate to our local TCP connections for each peer, which will actually propagate to each remote peer's TCP socket. In other words, when we're under load, the system will naturally stop accepting new requests off the socket.

In the beginning, most advice looks like "Every queue should have a maximum size. Queues must not grow unbounded."

The average async fan replaces all their unbounded channels with bounded channels and thinks they're done.

The average async enjoyer ...

Deadlocks #

Problems start to arise when loop branches or handlers need to await on some shared and limited resource, particularly when the system is under load. These resources are most often

  1. enqueueing messages on mpsc (multi-producer, single-consumer) bounded channels (for communication b/w actors)
  2. acquiring permits from concurrency limiting semaphores (to limit concurrent requests or spawned tasks)
  3. sending messages on a socket.

When there's a wait cycle (e.g. task A waits on task B waits on task C waits on task A), the system deadlocks. Especially problematic is that deadlock issues tend to only crop up non-deterministically when the system is under load and the bounded queues start hitting capacity.

If your queue sizes are large and you're not running load tests, these problems will manifest in production at 3 am during your on-call shift.

Trying to handle backpressure properly is difficult. It's not that hard to just throw bounded queues everywhere and pray, but this is not sufficient in my experience.

In fact, many systems decide to completely eschew handling backpressure internally. Erlang actors communicate exclusively using unbounded channels. Erlang also doesn't bound spawned tasks (until the process crashes or OOMs of course :)).

This approach almost completely avoids deadlocks at the cost of other problems like unbounded resource usage and exploding tail latencies when the system is under load. The request latency is the real killer; the system becomes unusable when all requests start timing out as the mean request service time exceeds the mean timeout.

In this sort of system, a typical approach is to bound external requests "at the edges" with the expectation that internal load will remain proportionally bounded. This could mean setting external request rate limits and total in-flight request limits at your loadbalancers, hoping that "internal" system utilization remains nominal.

The "ideal" way to avoid deadlocks with bounded channels is to design your system topology so there can never be wait cycles, i.e., all Actor communication follows a directed acyclic graph (DAG). Every node in this graph is an async event loop. Every directed edge in this graph is a bounded FIFO queue that starts from an mpsc::Sender and ends at an mpsc::Receiver. In a more perfect world, the Diem topology would be a DAG like:

flowchart LR;
    I[Inbound Network Messages] --> M & C & S;

	C[Consensus] --> S & M & O;

	S[State Sync] --> M & O;

	M[Shared Mempool] --> O;

	O[Outbound Network Messages];

Unfortunately, the current system grew organically and currently looks something like (heavily simplified):

flowchart LR;
	O[Sockets x N] --> P;

	P[Peers x N] --> O & N;

    N[Peer Manager] --> P & C & S & M;

	M[Shared Mempool] --> N;

	C[Consensus] --> S & M & N;

	S[State Sync] --> C & M & N;

Wait a second. Didn't we just advise against cycles? How do you avoid deadlocks??

Here's a dirty little secret... Diem doesn't actually use bounded FIFO queues in most cases.

There's a lossy multi-queue implementation called diem_channel (very creative name, I know) which, if you squint a bit, looks like an Arc<Mutex<HashMap<K, VecDeque<V>>>> . Their most common use is actually to provide fairness and quality-of-service across multiple peers and application protocols (Consensus, State Sync, Shared Mempool). It's like each PeerId x ProtocolId pair gets its own exclusive mini-queue separate from the others. The receiver round-robins across all the mini-queues (to provide fairness) when dequeue'ing. If one peer starts sending our node lots of messages, their queue will fill up, but also leaves the others unaffected.

These diem_channels are used in almost all of the edges above. These queues actually drop messages when at capacity, according to an eviction policy.

More generally, dropping messages is the other alternative to backpressure: you can't deadlock if you never block! Dropping messages has a price, however: protocols are more difficult to develop since they need to be robust against messages dropping at any time.

The Diem Consensus protocol even uses size=1 LIFO message "slots" for each peer, since the protocol only cares about the absolute most recent message (when optimized for min latency).

Note that without backpressure, you don't have an obvious way to tell a peer to "backoff" and stop sending you messages when your system is under load. You must provide your own rate limits or ad-hoc protocol backoff messages.

It might not surprise you that distributed systems are not simple, and backpressure is not an exception. If there's anything you should take away from this article, it's that simple rules like "bounded FIFO queues everywhere" are just heuristics that should be followed with intention and understanding of the tradeoffs involved.